Skip to content

fix(alerts): Don't group on time #89447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 21, 2025
Merged

Conversation

ceorourke
Copy link
Member

@ceorourke ceorourke commented Apr 11, 2025

Issue alerts that have "slow" conditions over a large time period (1 day, 1 week, 1 month) can hit the 10,000 limit we have for results when there are a lot of group to process as we return timestamped data per group (e.g. group id 1 has data for 12:01, 12:02, 12:03, and so on and that's multiplied by the number of groups we're querying) and we'll drop the remaining data past the limit which results in lower numbers than we'd expect. This can cause alerts to not fire as the data we're receiving isn't accurate.

To fix this we'll disable group_on_time and instead of receiving all the timestamp data that we add up later, we'll simply receive the sum without the timestamps.

I renamed get_sums to get_timeseries_sums and replaced the usages that rely on the timeseries data with that. The places that use get_sums now expect to receive the aggregate data in a simple dictionary {group_id: count}.

referrer_suffix=referrer_suffix,
)
result: Mapping[int, int] = {}
if tsdb_function == tsdb.get_sums:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to add group_on_time to the other tsdb functions we may pass since they don't use it anyway, but this feels a little hacky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option could be to use functools.partial(tsdb.get_sums, group_on_time=True) when you pass it, instead of passing it through here.

I think either that, or adding group_on_time as a param to all funcs is the right move

Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to do this for get_distinct_counts_totals too

referrer_suffix=referrer_suffix,
)
result: Mapping[int, int] = {}
if tsdb_function == tsdb.get_sums:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option could be to use functools.partial(tsdb.get_sums, group_on_time=True) when you pass it, instead of passing it through here.

I think either that, or adding group_on_time as a param to all funcs is the right move

Comment on lines 216 to 217
group_1_id: 5,
group_2_id: 5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the counts 5 here, when we created only 4 events per group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is because the base setup creates events with the same fingerprint, I'll update the fingerprint to be different

Copy link

codecov bot commented Apr 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #89447   +/-   ##
=======================================
  Coverage   87.71%   87.71%           
=======================================
  Files       10180    10180           
  Lines      574508   574541   +33     
  Branches    22632    22632           
=======================================
+ Hits       503932   503969   +37     
+ Misses      70148    70144    -4     
  Partials      428      428           

@ceorourke ceorourke marked this pull request as ready for review April 15, 2025 22:59
@ceorourke ceorourke requested review from a team as code owners April 15, 2025 22:59
@ceorourke ceorourke requested a review from wedamija April 15, 2025 23:13
Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's merge it when we can be around to watch the deploy though

@ceorourke ceorourke merged commit 37cef4e into master Apr 21, 2025
64 checks passed
@ceorourke ceorourke deleted the ceorourke/dont-group-on-time-tsdb branch April 21, 2025 17:26
andrewshie-sentry pushed a commit that referenced this pull request Apr 22, 2025
Issue alerts that have "slow" conditions over a large time period (1
day, 1 week, 1 month) can hit the 10,000 limit we have for results when
there are a lot of group to process as we return timestamped data per
group (e.g. group id 1 has data for 12:01, 12:02, 12:03, and so on and
that's multiplied by the number of groups we're querying) and we'll drop
the remaining data past the limit which results in lower numbers than
we'd expect. This can cause alerts to not fire as the data we're
receiving isn't accurate.

To fix this we'll disable `group_on_time` and instead of receiving all
the timestamp data that we add up later, we'll simply receive the sum
without the timestamps.

I renamed `get_sums` to `get_timeseries_sums` and replaced the usages
that rely on the timeseries data with that. The places that use
`get_sums` now expect to receive the aggregate data in a simple
dictionary `{group_id: count}`.
Copy link

sentry-io bot commented Apr 23, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ SnubaError: HTTPConnectionPool(host='snuba-api', port=80): Read timed out. (read timeout=30) sentry.tasks.digests.deliver_digest View Issue
  • ‼️ RateLimitExceeded: Query on could not be run due to allocation policies, info: {'details': {'ConcurrentRateLimitAllocationPolicy': {'can_run': False, 'max_threads': 0, 'explanation': {'reason': 'concurrent policy 17 exceeds limit of 16', 'overrides': {}, 'storage_key': 'S... sentry.tasks.digests.deliver_digest View Issue

Did you find this useful? React with a 👍 or 👎

@github-actions github-actions bot locked and limited conversation to collaborators May 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants